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Abstract 

Standard Gaussian graphical models (GGMs) implicitly assume that the conditional inde- 
pendence among variables is common to all observations in the sample. However, in practice, 
observations are usually collected form heterogeneous populations where such assumption is 
not satisfied, leading in turn to nonlinear relationships among variables. To tackle these prob- 
lems we explore mixtures of GGMs; in particular, we consider both infinite mixture models 
of GGMs and infinite hidden Markov models with GGM emission distributions. Such mod- 
els allow us to divide a heterogeneous population into homogenous groups, with each cluster 
having its own conditional independence structure. The main advantage of considering infinite 
mixtures is that they allow us easily to estimate the number of number of subpopulations in the 
sample. As an illustration, we study the trends in exchange rate fluctuations in the pre-Euro 
era. This example demonstrates that the models are very flexible while providing extremely 
interesting interesting insights into real-life applications. 

Keywords: Covariance selection; Dirichlet process; Gaussian graphical model; Hidden 
Markov model, Nonparametric Bayes inference 

1 Introduction 

Problems with small sample sizes and large number of unknown parameters represent one of the 
most challenging areas of current statistical research. Graphical models deal with this type of ill- 
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posed problems by enforcing sparsity in the conditional dependence structure among outcomes. 
More specifically, given a random vector X = (Xi, . . . , Xp) e W\ a graphical model for X encodes 
the conditional independence relationships between its components through a p-vertex graph G, 
such that vertex i represents component Z,- and the lack of an edge between nodes i and j indi- 
cates that variables i and j are conditionally independent. In particular, Gaussian graphical models 
(GGMs), also known as covariance selection models ( |Dempster| |1972| ), have become extremely 
popular in applications ranging from genetics ( [West et aLj [2001^ |Castelo & Roveratol |2006[ ) to 
econometrics and finance ( [Carvalho & West[ |2007| [Dobra et aL| |2008| ). GGMs assume that the 
joint distribution of X follows a multivariate Gaussian distribution, and therefore conditional inde- 
pendence among variables can be enforced by setting to zero the appropriate off-diagonal elements 
of the inverse covariance (precision) matrix. 

One important shortcoming of GGMs is that they implicitly assume a linear relationship be- 
tween variables. Copulas have been used in the context graphical models to address nonlinearities. 



For example, [Bedford & Cooke] ( [2002[ ) decompose the joint distribution of X using pairwise copu- 
las; however, the resulting models are computationally difficult to fit, specially when p grows. An 
alternative to copulas is to model non-linearities through mixtures of GGMs. Countable mixture 
models explain nonlinearities in the conditional expectations as a consequence of hetherogeneity 



of the population, and can therefore be interpreted as providing adaptive local linear fits (Miiller 
etalj [T996t [Rodriguez et all [20091 ). 



As a motivation for investigating mixtures of GGMs, consider the analysis of gene expression 
data. GGMs have been often used in the context of microarray data, where the graph encoding the 



conditional dependence structure provides information about expression pathways (Dobra et al. 



2004, Friedman, 2004, Castelo & Roveratol 2006). The implicit assumptions in these models is 



that the expression pathways are the same for all individuals/tissues in the sample and that expres- 
sion levels on different genes are linearly related, which might not be justified if the underlying 
population is heterogeneous. Similarly, when studying the relationship between economic vari- 
ables such as exchange rates, graphical models allow us to identify groups of countries that form 
economic blocks and understand how these blocks interact with each other. However, as trade 
patterns evolve, we expect that both the block membership and the modes in which countries in- 
teract might change, making the constant-graph assumption unrealistic. In both of these setting, 
mixtures of GGMs not only provide us with a tool to induce sparsity in heterogeneous samples, 
but also generate highly interpretable models. 

The major challenges in implementing mixtures of GGMs are computational, and relate both 
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to the determination of the underlying graph associated with each component in the mixture and 
to the estimation of number of components. Indeed, it is well known that the number of possible 
partitions for the data grows exponentially, a problem that is compounded when we desire to also 
estimate the number of components in the mixture and the graphical structure corresponding to 



each cluster. Work in finite mixtures of graphical models goes back at least to Thiesson et al. 



( 1997 ), who fixed the number of components in the mixture and developed a search algorithm that 
uses a modified Cheeseman-Stutz approximation to the marginal likelihood coupled with EM steps 
to estimate component-specific parameters. However, to the best of our knowledge, the problem 
of determining the number of components in mixtures of graphical models has not been properly 
addressed before. 

In this paper, we present the first fully Bayesian approach to inference in nonparametric mix- 
tures and infinite hidden Markov models with Gaussian graphical models as kernel/emission dis- 



tributions. Using infinite mixture models provides full support in the space of distributions (Lo 



1984, Ongaro & Cattaneo 2004[ ), and allows us to automatically deal with an unknown number 



of components/states within a simple computational framework. The hidden Markov models we 
discuss allow for the graph encoding the conditional independence structure of the data to change 
over time, an important feature that has been missing in multivariate time series models employ- 
ing graphical models ( [Carvalho & West[ |2007[ |Wang & West[ |2009| ). Since the paper focuses on 
models that have a Polya urn representation, we construct marginal samplers ( |Neal 2000) that 
explicitly integrate out the mean and variance of the individual GGMs. For problems where the 
the main interest is either prediction or inference the partition structure and/or the graphical struc- 
ture associated with each component, this approach greatly reduces computational complexity by 
avoiding the explicit representation the mean and variance of the different components/states. Al- 
though the paper develops models based on GGMs, the approaches we discuss are not restricted 
to multivariate continuous outcomes, but can be extended to incorporate combinations of binary, 
ordinal and continuous variables by introducing latent auxiliary variables. 

To simplify our exposition we begin by reviewing Bayesian approaches to inference in Gaus- 
sian graphical models in Section [2] and introducing Dirichlet process mixtures of Gaussian graph- 
ical models in Section [3] and |4j We then move to discuss more general nonparametric mixture 
models in Section [5| including species sampling mixtures of GGMs and infinite hidden Markov 
models with GGM emission distributions. These models are illustrated in Section |6] with a sim- 
ulated and a real- world dataset. Finally, we conclude in Section [7] with a discussion of possible 
extensions and future research directions. 



3 



2 Bayesian Framework for GGMs 



Let X = Xyhe the vector of observed variables, where V = {1,2, ...,p]. We assume that X follows 
a multivariate Normal distribution p(X\/u,K) = Np{fi,K~^) with mean vector e W and p x p 
precision matrix K = (Kjj). We consider the set of decomposable graphs associated with V. 
The Gaussian graphical model (GGM) associated with a graph G = {V, E) 6 Qv is obtained by 



setting to zero the elements of K corresponding with missing edges in G ( |Dempster| |1972| ). The 
absence of the edge (z, j) e {V xV)\E implies Kjj 
are conditionally independent given Xv\[ij], i.e. 



Kji = which in turn imphes that Xi and Xj 



Xj ii- Xj \ Xv\{ij]. 

The precision matrix K belongs to the cone Pg of the symmetric positive definite matrices with 



entries equal to zero for all e (V xV)\ E (Atay-Kayis & Massam 2005). The conditional 
dependence relationships implied by G induce the following factorization of the joint distribution 
of X dPawid & Lauritzeni[T993l ): 

UcecP(^c\l^c,Kc) 



p{X\n,K,G) 



(1) 



where C denotes the cliques of G and S denotes separators of G. For an index set Vo c V, is 
the subvector of // corresponding to the entries in Vq, while Ky^ = ((K~^)vgy^ ■ We remark that the 
subgraph Gc = (C, Ec), Ec = {{i, j) e E : i, j e C), associated with a clique C e C is complete, 
that is, there is no edge missing from it. Similarly, the subgraph Gs associated with a separator 
S e Sis also complete. 



2.1 Prior specification 

We consider the following joint prior distribution for jj. and K: 

p(lJ,K\G)=pQu\K,G)p(K\G), 



(2) 



where, conditional on K, the prior for the mean is piiAK, G) = Np(po, {noK) with p.^ e W and 
riQ > 0. The prior for the precision matrix p(K\G) = WdSo, Dq) is a G-Wishart distribution with 



density (Roverato. 2002, Atay-Kayis & Massam, 2005, Letac & Massam 2007) 
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/g(5o,^o) 



(det K) 



(5o-2)/2 , 



exp^-^<iS:,A)) 



(3) 



with respect to the Lebesgue measure on Pq. Here {B,C) = tr(5^C) denotes the trace inner 
product, piaconnis & Ylvisaker| ( |1979| ) prove that the normalizing constant IciSo, Do) is finite if 



60 > 2 and D^' e Pq. If G is complete (i.e. G has only one clique C = {V} and no separators), 
Wg(So, Dq) reduces to the Wishart distribution Wp(6o, Do), hence its normalizing constant is given 
by 



Ig(6o, Do) = 2^^o^p-^^p/2y^ {(s, + p- l)/2] (det Do) 



-(<5o+p-l)/2 



(4) 



where = uP^p-^^'^ IlfjJ r(a - |) for a > {p - l)/2 (Muirhead 2005 1. If G is decomposable 



but not necessarily complete, Dawid & Lauritzen] (1993]) prove that the G-Wishart distribution 
Wg{5o, Do) can be factorized according to the cliques and the separators of G, hence its normalizing 
constant is equal to ( [Roverato] |2002[ ) : 

nCeclGc(.5o,{Do)c) 



lGiSo,Do) = 



(5) 



nseslGs(5o,{Do)s,) 

The subgraphs Gc and G5 associated with each clique and separator of G are complete, thus 
Icci^o^ (Do)c) and Icsi^o^ (^o)s) are explicitly calculated as in Q. 



2.2 Posterior distributions and the marginal likelihood of a graph 

The likelihood function for /j. and K corresponding with an i.i.d. sample x*^^'"^ = (^x^^\ . . 
X ~ NpQu, K'^) is given by 

L(ii, K\/^'"^) = (2;r)-"^/2(det K)"'^ exp |-^<^, U + n(x - iu)ix - m)^)^ , 



■^"^j from 



(6) 



where ^ = ^ Yj'i=i U = - x)(x^'^ - x)^ . The joint prior ([6j) is conjugate to the likelihood 

(|6|. We assume that the data x'^^'"^ have been centered and scaled to unit variance, so that the 
sample mean of each X, is zero and its sample variance is one. We complete the prior specification 
by taking /io = 0, 5o = 3 and Do = Ip, where Ip is the p-dimensional identity matrix. The 
interpretation of the resulting prior is that the components of X are apriori independent and that the 
posterior "weight" of the prior is equivalent to one observed sample. Other possible choices for 
the G-Wishart prior parameters 60 and Dq are discussed in |Carvalho & Scott] ( |2009| ). 
The marginal likelihood 



Pix^^'-'^IG) 



-- J J L(jx,K\/^'"^)p(ju,K\G)dKd/^ 



associated with a graph G e is 



, ^° 7(if)(det^)— exp --<iS:,t/ + Do)^ Jif, (7) 



where 



J(K) = J expi^-^{K,n(x-iu)ix-id)^+no(ju-iUo)(ju-fiof)^diJ., 

= J expi^-^{K,(n + no)(M-fi)(fi-p.f+A)^dfi. 

with p. = and A = -{n + no)fifi^ + nxx^ + no/ioA'o • After factorizing out exp |-^(^, A)|, the 

remaining integrand is the kernel of the posterior distribution of ji: 

p{fi\/'--''\ K, G) = Np (fi, [(n + noWr') . (8) 

Therefore 



and it follows that the marginal likelihood (|7]) becomes 

(2;r)-T / no Y'^ C ^o«-2 f 1 1 

j\J T^A ^^] (detif)— exp --<if,Do + [/+A) Jif. 

Ig(6o, Do) \n + no I J [ 2 ) 

KePa 

The integrand in the equation above is the kernel of the G-Wishart posterior distribution of K: 

piKl/'-'^K G) = Woido + n,Do + U+ A). (9) 
and therefore, the final form of the marginal likelihood of the data given G is 

M.-'«|G) = (2.)^*(^r«*i^^l^5±f±^. (10) 

\n + no] Ig{Sq,Do) 

A similar argument shows that the posterior predictive distribution of a new sample x^"'^^^ is 
given by 

\n + I + noj Icido + n,Do + U + A) 

where A = -{n + 1 + no)JjJf + x^'^^^^{x^'^^^^Y + (n + no)fijJ.^ and/7 = n+ilnT^^ ■ Since G is assumed 
to be decomposable, the posterior normalizing constant /g(^o + n, Do + U + A) can be calculated 
directly using a formula similar to equation ([s]), hence />(jc^""^'^|G) and p{x^"'^^''\x'-"\ G) can also be 
calculated directly without any numerical approximation techniques. These computations are key 
to a successful implementation of the sampling algorithms we describe in Section]?} 
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3 Dirichlet Process Mixtures of GGMs 



Consider now a mixture of GGMs 



x\{w,}Afi^]AK;],{G*,} - YjW,p(x\mIk;,g*) 



(12) 



1=1 



where p(X\ fu*i,K^,G*i) is given by ([T]). In this model, draws from from X come from one of L poten- 
tially different graphical models; a realization x*^'^ comes from the Z-th graphical model (which is de- 
fined by the parameters fi*i, and Gp independently with probability w/. A fully Bayesian specifi- 
cation of the model is completed by eliciting prior for the parameters ({w/}/~j , {/i,*})^! , {K* , {GJ }J^j). 
A common choice is to set w = (wi, . . . , wl) ~ Dir(w°) and let the component specific parameters 
Qu*i,K^, Gp be i.i.d. samples from some common distribution M. 

Finite mixtures as the one described above allow for additional flexibility over regular GGMs 
by allowing a heterogeneous population to be divided into homogenous groups. However, esti- 
mating finite mixture models involves important practical challenges. For example, in practice 
we generally do not know how many components are present in the population. We could allow 
L to be random and assign a prior distribution to it, but fitting the resulting model involves the 



use of reversible-jump Markov chain Monte Carlo (RJMCMC) methods (Green 1995 ), which are 
notoriously inefficient for high dimensional mixtures. 

As an alternative, this section considers Dirichlet process mixtures of GGMs (GGM-DPM). 



Note that ( 12 ) can be alternative written as 



X\H 



I 



p(,X\ IX, K, G)H(dp, dK, dG) 



//(•) = £w/5(^;,i^;,G;)(-) (13) 



where 6ai-) denotes the degenerate probability measure putting all its mass on a. Therefore, elic- 
iting a prior on ({w/}J^j, {yU^/^i, {GJ}^j) is equivalent to defining a prior on the discrete 



probability measure H, one such prior is the Dirichlet process (Ferguson, 1973 1974). A random 
distribution H is said to follow a Dirichlet process (DP) with baseline measure M and precision 



parameter oq, denoted DP(ao, M), if it has a representation of the form (Sethuraman 1994 ) 

DO 

H(-)-YjW,do;i-), (14) 

1=1 

where 6\,92,-.. are independent and identically distributed samples from the baseline measure 
M and wi = ui ns</(l ~ where U\,U2,--- is another independent and identically distributed 
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sample where u; ~ Beta(l, ffo)- We refer to the joint distribution on (wi,W2, . ■ ■) induced by the 
above construction as a stick breaking distribution with parameter ao, denoted SB(ao)- The DP 
mixture (DPM) model is recovered from ( [T3| ) when H ~ DP{ao,M) for appropriately chosen 
hyperparameters ao and M. 

Consider now an independent and identically distributed sequence 6\,. . . ,6n such that 6j\H ~ 
H, where H ~ DP(ao, M). A useful feature of the DP prior is that the joint distribution for 
{6\,.. .,0n) obtained after integrating out the random H is given by a sequence of predictive distri- 
butions dBlackwell & MacQueenl |1973|) where 6i ~ M and 



6j+i\0j, . . . , 01, ao ~ — ^- — A 



+ 



(=1 



Co + J 



ao + J 



;>i. 



(15) 



The presence of ties in the sequence 6\,. . . sometimes makes it convenient to use an alter- 
native representation where 6\, . . . ,6\ denotes the set of 1 < L < n unique values among di, . . . ,9n 
and ^1, . . . is a sequence of indicator variables such that 9j = 6*^,. Under this representation, 
( fTSj ) implies that 9\,62,.-. is a sequence of independent and identically distributed samples from 
M,^i = 1 and 



V 



^j+il^j, . . . ,^i,ao ~ y — —6i + 
^ ao + j 



1=1 



Q'o + J 



(16) 



where U = max,<y{^,} is the number of distinct values among 6i, . . . , dj, and rj = 1(^,=/) is the 



number of samples among the first / with = I. Expressions ( [15] ) and ( [T6| ) clearly emphasize that, 
for any finite sample x^^-'^\ the number of non-empty components L" = L in a DPM model is a 



random parameter in the model. The prior on L implied by the DP ( Antoniak 1974) is given by : 

L=l,...,n, (17) 



p(L\ao,n) = S{n,L)nlaQ ^ 



ao + n 

where S{-, •) denotes the Stirling number of the first kind. Therefore, the mean number of non- 
empty components grows with ao, the concentration parameter. 



The DP mixture model is intimately connected to the finite mixture model in ( |T2| ). Consider a 
finite mixture with A'^ components such that 

x^^Witv l^} "=i - ^j\ao - Multinom(ao/M . . . , ao/N), ff; ~ M. (18) 

As A'^ oo, the predictive distribution for x^^'"^ under this model converges to the one obtained 



from the DP mixture (Green & Richardson 2001 Ishwaran & Zarepour, 2002). 
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In the GGM-DPM model we explore in this paper we have 6 = Qj.,K, G) and the baseline 
measure is defined by 



M = p(4i,K\G)p{G), 



(19) 



where p{jx, K\G) is given by (|2]) and p{G) oc 1 is the uniform prior on Qy Other choices of priors 



on Qv that encourage sparsity or have desirable multiple testing properties are discussed in [Jones 
etaL] ( |2005] ); [Scott & Berger| ( |2006l ); [Scott & Carvalho| ( |2008] ). 



The GGM-DPM model is a natural extension of the well-known DP mixture of multivariate 



normals originally presented in Miiller et al. ( 1996 1, but the introduction of the component- specific 
graphical structure allows us to induce sparsity in the estimation of the precision matrix associated 
with the mixture components. The point estimates provided by the GGM-DPM model we just de- 
scribed can be interpreted as providing doubly-regularized estimates of the cluster-specific covari- 
ance matrices; one level of regularization arises because of the introduction of the prior distribution 
on the number of components, which introduces a penalty structure on the number of cluster equal 



to the logarithm of ( [T7| ), while the second level of regularization arises because of the introduction 
of the prior p{G) on the graph encoding the cluster-specific conditional independence structure. It 
is well known that, for high dimensional problems, estimation of the covariance matrices {Kj^}\^^ 
can be extremely unstable and that regularized estimators produce improved results; similar ap- 



proaches to regularization have recently proved eff"ective in both graphical models ( [Wainwright 
etal.[[2006| ) and mixture models ( [Fraley & Raftery[|2007| ). 

Although the model just described induces sparsity on the structure of the component specific 
covariance matrices [Ki], the fact that we are using a mixture of GGMs as the data generating model 
means that such sparse structure does not translate into conditional independence for the variables 
involved. Indeed, even if for a given pair j) we have {Ki)ij = for all /, X, and Xj are not 
conditionally independent under the GGM-DPM model. Therefore, all conditional independence 
assumptions derived from the graphs {G/} are valid only conditional on cluster membership. 



4 Computational implementation of GGM mixtures 

As with regular GGM models, the posterior distribution arising from the DDP-GGM model is not 
analytically tractable because of the sheer size of the space of partitions and accompanying graphs. 
Therefore, we resort to MCMC algorithms to explore the features of this complicated posterior dis- 
tribution. The literature on MCMC samplers for the DPM model has grown extensively in the last 
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15 years; the approaches can be roughly divided in three large classes: collapsed (marginal) Gibbs 
samplers ( MacEachern[ 1994 , Escobar & West , 1995 , Neal , 2000 1, which exploit the exchangeabil- 
ity in the data and the Polya urn representation in ([15]) and ([16]) to construct algorithms that avoid 
explicitly sampling H, blocked samplers (jlshwaran & James] |2001^ [Roberts & Papaspihopoulos[ 



2008, Walker, 2007), which explicitly represent the mixing distribution H, and Reversible Jump 



samplers (Jain & Neal 2004[ [2007| ). In this paper we focus attention on marginal samplers such 
as the ones described in |Neal| ([20001) as a natural option that provides some computational advan- 



tages. Indeed, the structure of the baseline measure M in ( [T9| ) is such that we can easily integrate 
the means {//;} and precision matrices {Ki) out of the model and create a sampler that acts on the 
space of partitions and graphs directly, which can dramatically reduce the computational burden. 

Given an initial state where the data x^^-"^ has been divided into L clusters through indicator 
variables ^i, . . . ,^„, and where graphs Gi, . . . , are associated with each of the components, the 
algorithm proceeds to sample from the joint distribution of (L, {^j}"^!, {G/}^p aol-^*^^ "^)- As a first 
stage we update the sequence of indicators {^jTj=i (and, implicitly, the number of components L) 
by sequentially sampling each for j = 1 , . . . , n from its full conditional distribution 

pi^jir^ = {^j']r*j, x^''-"\ {G,}fZ) cx ^ qj,6,, (20) 



1=1 



where 



qji 



oc < 



rJ'pixHx'-^' ) : / ^ J, ^y, = /}, G,), I < L K 



In the previous expression, L"^ is the number of clusters in the sample (excluding observation x^-'^), 
rj-' = Yuj'i^j l{f'=/) is the number of observations included in cluster / (excluding observation j if this 
sample currently belongs to cluster Z), p{x^^^\{x^^"' : / j, = /},G/) is the posterior predictive 
distribution of sample x^^^ given the samples that are currently in the /-th cluster (excluding x''-'^ if 
it happens to belong to this cluster) and the graph G/ associated with this cluster - see equation 
( [TT] ), and p{x^^^\Gi-j+\) is the posterior predictive distribution of sample jc*^^^ given an empty cluster, 
which is calculated by setting n = 0, fi = O(pxi) and U = O(pxp) in equation ( [TT] ). The graph Gl-j+i 
is to be randomly sampled from our baseline measure on ^y, which we labeled p(G) in ( [T9] ). If the 
last observations has been moved out of a cluster, that cluster is deleted and L is decreased by 1 . 
Similarly, if an observation is moved to a new cluster that is currently empty, L is increased by 1 . 

Once the cluster assignment has been updated, the graph G/ associated with each cluster / = 
1, ... ,L is also updated as follows. We let the neighborhood of G/, denoted by nbd^^,(G/), be the 
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set of decomposable graphs that can be obtained from G/ by adding or deleting one edge. These 
neighborhood sets connect any two graphs in Qy through a sequence of graphs that differ by exactly 
one edge - see, for example, |Lauritzen| (| 1996|). We draw a candidate graph G^" from the uniform 
distribution on nbd^j,(G/). We change the graph associated with cluster / to G"''*' with probability 

. ^ ^ p{{x'J^ : = Z}|G;'^"')/|nbd^,(G;'^"')| 
mm< 1 



: ^. = Z}|G/)/|nbd^,(G,)| 

otherwise the graph associated with cluster I remains unchanged. Here p({jc^^^ : = 1}\G) repre- 
sents the marginal likelihood of the samples currently in cluster I given a graph G - see equation 



( [T0| ). We denote by \B\ the number of elements of a set B. To improve mixing, we update the graphs 
associated with each cluster multiple times before another cluster assignment update is carried out 
(typically, between 5 and 10 times seems to provide adequate mixing). 

These two sequences of steps produce a sample from the posterior distribution of interest, 
(L, {^y}"^j, {G/}f^p coIjc^'-"^) without any need to sample the means {yu}f=i or precisions {^If^j. There- 
fore, if we are only interested in inferences about the clustering structure or the graphical structure 
associated with the clusters, or on predictive inference, the previous algorithm is sufficient and can 
dramatically reduce the computational burden of the algorithm. However, if needed, the mean and 
variances of each mixture component can be easily sampled conditional on by noting that 

for every Z = 1 , . . . , L (see equations ([8]) and (|9])) 

Ki\Gu : = 1}- Wg,{6o + r,, Do + t/, + A,), 
H,\KuGu [x^'' : = 1]- N,,Ou/, [(r/ + mWiY'), 

independently of other components. The subscript Z denotes the corresponding values computed 
using only the observations assigned to component Z (for example, r/ is the number of observations 
assigned to component Z). 

Also, additional flexibility can be obtained by sampling some of the hyperparameters associ- 
ated with the DP prior. For example, the concentration parameter uq controls the expected number 
of components, and therefore has an important effect on the inferences generated by the model. 
Since eliciting values for uq can be extremely difficult in practice, it is recommendable to try to in- 
fer it from the data. For example, we can assume a vague G(ao, bo) prior for the precision parameter 
qq, in which case the full conditional distribution can be easily sampled using an auxiliary-variable 



Gibbs sampling step (Escobar & West 1995) (see Appendix) 
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5 More general nonparametric mixtures of graphical models 



The ideas just described for Dirichlet process mixtures of Gaussian graphical models can be di- 
rectly extended to other nonparametric mixture models where Polya urn representations similar 



to ( [T5| ) and ( [T6| ) can be exploited to analytically integrate out the random distributions out of the 
model. Some example include species sampling models ( [McCloskey , 1965, Pitman, 1996, Lijoi 



eTaLj [2007| |Lee et al.j |2009j), hierarchical Dirichlet processes (|Teh et aL| |2006[), nested Dirich- 



let processes (Rodriguez et al. 2008 ) and linear combinations of Dirichlet process (Miiller et al. 



2004|punson et aL||2007|). In this section we consider in detail two such extensions. 



5.1 Species sampling models 



As a first example, consider replacing the Dirichlet process mixture of GGMs with a more general 
species sampling mixture of GGMs. An exchangeable sequence 6\,...,6„ is said to follow a 
species sampling model ( |McCloskeyj [1965^ [Pitmanj |1996t |Lijoi et al.[ |2007| |Lee et al.[ [20091 ) 
with baseline measure M if 6j = 6*^ where 6\,6*2, . . . h di sequence of independent and identically 



distributed samples from M and the indicators , . 
formula 



, ^„ are sampled according to the predictive 



(21) 



1=1 



Vji = 1 for all i and rj and U are defined as in Section 



where the weights Vyi, . . . Vj^u+i satisfy X,=i 
|3] The Dirichlet process is the best known member of the class of species sampling models, which 
also includes the two parameter Poisson-Dirichlet process ( |Pitman[fl996^|Ishwaran & James[|2001| ) 
and the normalized inverse-gamma priors( |L^oi et al. 2005 1, among others. 

Moving beyond DPM models is of interest because the prior on the partition structure induced 
by the Dirichlet process can be somewhat restrictive. For example, ( [T7] ) implies that, a priori and for 
a given precision parameter ao, the expected number of occupied clusters grows with the logarithm 



of n, which might be inappropriate for certain application such as computer vision (Sudderth & 



Jordan 2009[ ). Also, the Dirichlet process favors partitions that consist of a small number of 



clusters with a large number of observations along with a larger number of small clusters. 

Since samples from a species sampling model are exchangeable, the predictive distribution pT] ) 
also provides the full conditional distribution required to implement the Gibbs sampling algorithm 
discussed in Section [3} For example, for the Poisson Dirichlet process with baseline measure M, 
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discount < a < 1 and strength ao > -a we have 



L' j 

r, - a 



ao — aU 



—di + -^Su+i 

,^1 ao + J ao + J 

Note that if or = 0, the Poisson-Dirichlet process reduces to the standard Dirichlet process. 
Modifying the algorithm in Section|3]to fit a Poisson-Dirichlet mixture of GGMs is straightforward. 



In particular, we only need to slightly modify the posterior weights in ( |20| ) to reflect the new prior 
distribution, 

(r;^ - a)p(/JW''^ : / ^ j, ^j, = /}, d), I < L~K 
{ao - aL~J)p{x'J^\GL-j^i), I = L-i + 1. 



qji 



5.2 Infinite Hidden Markov Gaussian Graphical models 

Recently, multivariate time series models that use graphical models to improve estimation of the 



crosssectional covariance structure have been developed (Carvalho & West 2007, Wang & West 



2009). These approaches rely on extensions of the dynamic linear model (DLM) ( [West & Har 
1997| ) and assume that the graph underlying the model is constant in time which, as our 



nson 



first illustration in Section 6.2 demonstrates, might not be an appropriate assumption in practical 



application. As an alternative we focus on a nonparametric version of the popular hidden Markov 
model where the emission distribution corresponds to a GGM. 

Hidden Markov models (HMMs) ( |Cappe et"aL]|2005| ), are hierarchical mixture models where 



x^H(ri}l,A^j]]=„- p{x^%), ~Multinom(7r^-'), ~ Multinom(;r°) 6* - M. 

In this context the latent indicator e { 1 , . . . , L} is called a hidden state, while the entire set of 
indicators is called a trajectory. The ordering of the states is implicitly defined by the order- 

ing of their indices; trajectories evolve according to a Markov process with transition probabilities 
Pr(^y = = r) = n'l . The initial state probabilities are Pr(^o = = Conditionally on a set 
of states {^;}y^i, the observations x^^\ . . . , x'^"\ are independently distributed from state dependent 
distributions p(,-\e*^), . . .,p(-\e*). 



Infinite hidden Markov models (iHMMs) (BealetaL, 2001 Teh et al. 2006; van Gael et al 



2008 1 generalize HMMs to models with an infinite number of states, in a similar way as how 
Dirichlet process models generalize finite mixture models, allowing us to estimate the number of 
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states L. In particular, we can build a GGM-iHMM where 

^y|^y„i,{;r'}~i ~Multinom(;r^-') 
n'\a,y ~ DP(a,Y) 
y\ao ~ SB(Qro), 

and 9*1 = {jxi, Ku Gi) ~ M, where M is defined as in ([T9j). This GGM-iHMM has some distinct 



advantages over the dynamic linear models with graphical structure discussed in CarvaUio & West 



qji = 



( |2007[ ) and |Wang & West| ( |2009| ). In particular, it allows for the graph controlling the conditional 
independence structure of the data to evolve in time while still taking into account the sequential 
nature of the problem. 

Again, a marginal Gibbs sampler similar to the one described in Section |3] can be devised for 
the GGM-iHMM. We denote by rj^,'-'^ the number of transitions from state / to state Z' in the sub- 
trajectory {^jYjlj^ and by rj"^" the number of transitions out of state I in the same sub-trajectory. 
Given the base DP parameters y = {yi, . . . , yt+i) and the precision parameters ao and a, the states 
^1, . . . , ^„ are sequentially updated using the full conditional distributions: 

{r^^-;^ + r^:;f" + ay,) ^'"'^u-^^^^J.^^^' P(^V^ : / * h^r = l)^Gd, I < L-J,l^ 

ay,y^^^,p{x^J\Gt^,\ I = L'i + 1. 

If a new empty cluster needs to be created, we update the number of clusters L by setting = 
L + I and the vector y by setting y^t'^ = vyi+i, 7^+2 = (1 ~ v)7l+i were v ~ Beta(ffo, 1). 

Given a trajectory ffo and a, we sample y by introducing the independent auxiliary 

variables {mw } for Z, /' 6 { 1 , . . . , L} such that 

Pr(m,, = m) 5 irf"\ mXayiT, m 6 { 1 , . . . , 4!"^}, 

where 5(-, •) denotes the Stirling number of the first kind. Conditional on these auxiliary variables 
we can update y by sampling 

(yi,...,yz.+i) ~ Dir(m.i,...,m.i,ao), 

where m./- = "^//'- We use vague gamma priors for the precision parameters ao and or and 
update them as described in the Appendix. 
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6 Examples 



6.1 Simulated data 

We consider first a small simulation that involves data arising from a two Gaussian clusters. For 
brevity, the results we present in this Section correspond to a single run of the simulation, but these 
are representative of those obtained over multiple runs. In the first cluster samples are from a star 
graphical model NwiO-5, K^^) with every variable Xj, j > 2, connected to Xi, while the second 
cluster contains 100 samples from a cycle model A^io(-0.5, ^2 ^). The non-zero elements of the 
two precision matrices are 

(K,)jj = (K2)jj=h 7= 1,...,10, 

(K,hj = iK,)jj=03, J = 2,..., 10, 

(K2)j-uj = mjj-i = 0.3, J = 2, . . . , 10, 

(^2)1,10 = (^2)10,1 = 0-3. 

We are interested in recovering the two clusters and their corresponding conditional independence 
graphs. Since the first 100 observations belong to the first cluster and the remainder to the second 
cluster, we are also able to employ the iHMM discussed above and compare its performance to the 
DPM model. After sampling a dataset, we ran both DPM and iHMM samplers for 20,000 itera- 
tions after 5,000 iterations of burn-in. To determine the utility of including GGMs in the Dirichlet 
Process framework we also ran both the DPM and iHMM models with attention restricted to the 
full graph. 

Figure [T] shows the clustering for the DPM and iHMM both using the full space of decom- 
posable GGMs and with attention restricted to the full graph. The upper lefthand panel, which 
corresponds to the results from the DPM model using the full graph space, shows two well defined 
groups with a few observations incorrectly classfied. The upper righthand panel shows that restrict- 
ing attention to the full graph decreases the ability to cluster observations properly. The bottom 
row of Figure [T] shows that using the iHMM model in this case dramatically improves clustering. 
The iHMM model using the unrestricted graph space clusters the observations perfectly, while the 
iHMM model using the full graph does nearly as well, though it misclassifies one observation. 

Figure |2] shows the edge probabilities in each of the two clusters for the DPM and iHMM mod- 
els, along with the true edges from the underlying model. We see that for the most part, the edges 
with the highest probability in both models correspond to the true edges. However, the improved 
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clustering capabilities of the iHMM model translates into sharper edge probabilities, especially in 
the second cluster. 



6.2 Real data: understanding trends in exchange rate fluctuations 

First, we consider a dataset that follows the exchange rate of 1 1 currencies relative to the US dol- 
lar between November 1993 and August 1996. This dataset consists of 1000 daily observations 
and includes three Asian currencies (NZD,AUD, JPY), five European currencies that eventually 
became part of the Euro (DEM, FRF, BEE, NLG, ESP) and three additional European currencies 
(GBP, SEK, CHE). These data have previously been used in a variety of contexts related to graph- 
ical models (see e.g. [Carvalho et alj [20071 and [Carvalho & Westl[2007] ). In [Carvalho etaL] ( |2007l ) 
the authors present the graph shown in Figure |3} which is determined using stochastic search meth- 
ods first discussed in [Jones et al. (2005 1 over the final 100 timepoints. The authors note that this 
graph is sensible from the standpoint of known trading relations: the mainland European countries 
that join the Euro are closely linked in a single clique, while the British Pound (GBP), Swedish 
Krona (SEK) and Swiss Franc (CHZ) connect with only some of these counties, most notably the 
currencies of the largest Euro-area countries, the Deutsch Mark (DEM) and French Franc (FRF) 
(the Swiss Franc is also connected to the Netherlands Gilder (NLG), being more integrated with 



mainland European economies). [Carvalho & West (2007 1 then show that portfolio weights based 
on estimates from this graphical model give an investment strategy with increased return and re- 
duced variability when compared to using an approach that does not impose graphical structure on 
the estimates of S, evidence of the effectiveness of the graphical models approach. 

We used the exchange dataset to explore the possibility of alternating regimes with separate 
patterns of interaction during these 1000 days. Given that the data form a natural timecourse, we 



employed the GGM-iHMM model discussed in Section 5.2 We ran the model for 100000 iter- 
ations after a bum-in period of 20000 iterations, and ran five separate instances of the algorithm 
from separate starting points. After completion we assessed the results from each chain and veri- 
fied they returned the same estimates. Figure |4] shows the convergence in a and ao across chains. 
When run using the iHMM model, the observations for the most part clearly fall into one of two 
regimes. Figure [5] shows the posterior probabilities that two observations belong to the same state. 
The first state is comprised roughly of the time points one through 251 (1 1/14/1993 to 7/21/1994) 
at which point a second regime takes over. Interestingly the first interaction regime arises again for 
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Figure 1: Probability that two observations belong to the same cluster in the simulated example. 
The panel shows, in a clockwise order, the results using the DPM model in the unrestricted case, 
the DPM model restricting to the full graph, the iHMM model restricting to the full model and the 
iHMM model in the unrestricted case. Note that the true underlying clustering is shown by the 
iHMM model in the unrestricted case. 
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Figure 2: Edge probabilities in the simulated example. In each panel, the upper triangle shows 
the average edge probability of observations in the first cluster, while the lower triangle shows the 
average edge probability of observations in the second cluster. The lefthand panel corresponds 
to the results from the DPM search, the middle panel corresponds to the iHMM model and the 
righthand panel displays the true edges in the underlying model. 

a brief period roughly comprising timepoints 623 to 700 (7/29/95 to 10/14/95). These two regimes 
show a good deal of similarity in the associated graphical models, but also present some key dif- 
ferences. Figure [6] displays the graphical model associated with timepoint 40 (belonging to the 
first regime) and timepoint 540 (belonging to the second). The graphs shown are assembled from 
those edges that had a greater than 80% chance of inclusion for the respective observation, which 
is similar in spirit to displaying the top graphical model from a stochastic search, as performed by 
CarvalhoetaL] ( |2007] ). 

In the first regime, an association structure broadly consistent with the graph used in Carvalho 



and West [Carvalho & West] ( |2007| ) is present. We see a tight grouping of the Euro adopters (not 
quite a clique, but missing only an ESP NLG connection), but with increased connection between 
the Swiss Franc and the Euro countries. A clique amongst the Asian currencies is connnected to 
only three of the Euro adopters. Furthermore the Pound is only connected to the Euro through 
DEM, the currency of the economic leader of this area. The interpretation of this graph is similar 
to that reported earlier: the fluctuations in the exchange rate of mainland European currencies to 
the dollar roughly track one another. However, at this point the British Pound was no longer part of 
the European Exchange Rate Mechanism (ERM), following the Pound's crash on "Black Wednes- 
day", September 16th of 1992. This reason, along with the greater integration of trade between 



Britain and the US (as suggested by Carvalho et al Carvalho et al. 2007 1, leads to a separation of 
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Figure 3: Graphical model presented by Carvalho et al |Carvalho et al.| ( f2007| ), which represents 
the highest probability graph found using stochastic search methods over the last 100 timepoints 
of the exchange dataset, using the author's prior specifications. This graph has been used to show 
that investment strategies based on graphical models often have lower variability and higher yield 
than methods based on the full covariance matrix. 
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Figure 4: Convergence plot for a and ao across chains by log iteration. This plot shows the running 
average of these two parameters across five separate instances of the GGM-iHMM algorithm. Their 
mutual agreement implies the settings used are sufficient to assure convergence. 




Figure 5: Heatmap displaying probability that two observations belong to the same cluster for the 
exchange example. The figure shows two dominant regimes, one that runs for the most part from 
timepoints 1 to 251 and again roughly between timepoints 623 and 700 and the other which is 
present most of the remaining time periods. 
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Figure 6: Graphical models associated with timepoint 40 (left) and timepoint 540 (right) in the 
exchange rate example. These graphs were constructed by adding any edge that had greater than 
80% posterior inclusion probability for the respective timepoint. Such a high threshold was cho- 
sen in order to make a suitable comparison to the graph shown in Figure |3} which is the highest 
probability graph determined using stochastic search methods. 

the Pound from the mainland European currencies. 

The second regime has a similar structure but a markedly different interpretation of the interac- 
tions between the Pound and the smaller Euro adopters. At this point, the graph still is comprised 
of a large group consisting of the Euro countries, however the Pound has joined-and become a cen- 
tral part of-this grouping. It connects with each member of the Euro countries (as well as CHE) 
and subsequently connections between ESP and DEM and BEE and CHE are no longer present. 
Furthermore, the Asian countries lose two neighbors, CHE and BEE, and the AUD is now the sole 
currency connecting the Asian countries to both the DEM and GBP, though the JPY maintains its 
association with DEM. 

The greater connectedness of the GBP to the euro area may have been a result of the uncer- 
tainty regarding the specifics of the Euro's implementation in the mid-nineties. In particular, the 
crash of the Pound in 1992 and Britain's subsequent withdrawal from the ERM left a looming 
uncertainty regarding if, and when, Britain would again agree to join the common currency. The 
initial switch from a "UK-excluding" to a "UK-inclusive" regime in the exchange rate data occurs 
on July 22, 1994. What is curious about this date is that Tony Blair was elected to lead the Labour 
party on July 21, 1994. Blair would eventually run a campaign based, in part, on rejoining the ERM 
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and adopting the Euro, a stance he held until the events of 2001 . The graphical models displayed in 
Figure [6] suggest that currency markets began integrating the possibility that the Pound would join 
the Euro by exhibiting greater covariation with mainland European countries. This new regime 
was, itself, somewhat unstable, as evidenced by the return of a "UK-exclusive" regime during the 
summer of 1995. 



Carvalho & West (2007) use the exchange rate data to show that minimum variance portfolios 



will yield better return when GGMs are employed to estimate covariances. We considered a similar 
analysis and used the expectation over the sampled values of ju^+i and Kj+i for each T between 
20 and 300 as the first two moments of the predictive distribution and calculated portfolio weights 



wt assuming a target return of m = 0.1% per day (see [Carvalho & West| ( |2007[ ) for details regard- 
ing the construction of these portfolios). Figure [7] shows that the predictive distributions from the 
GGM-iHMM model give portfolio weights that have higher yields than both the iHMM only model 
and the GGM only model. This shows the ulitility of our framework: by considering a mixture 
model we are able to adapt to changing conditions, thereby leading to better specified predictive 
distributions. Furthermore, by incorporating the GGMs into the model formulation, we are able to 
induce sparser estimates of covariation, which likewise improve predictive performance. 



7 Discussion 



Although this paper has focused on two relatively simple models (nonparametric mixtures of 
GGMs and infinite hidden Markov models with GGM emission distributions), the basic struc- 
ture can be employed to generalize many other nonparametric models. For example, we plan to 
extend the nonparametric mixture classifier developed in Rodriguez & Vuppala ( |2009| ) to include 
GGM kernels as a way to improve classification rates in high-dimensional problems. Also, in the 
spirit of IMtiUer et al.| ( |T996l ), |Muller et al.| ( |2004[ ) and [Rodriguez et all] ( |2009| ), sparse nonlinear 
regression models can be generated by using GGM mixtures as the joint model for outcomes and 
predictors, from which the regression function can be derived by computing the conditional ex- 



pectation of the outcome given the predictors. This generalizes the work of Dobra et al. (2008 ) on 
sparse regression to allow for adaptive local linear fits. 

The implementation of GGM mixtures we have discussed in this paper exploits the Pdlya urn 
representation available for many nonparametric models to construct a Gibbs sampler that updates 
the grouping structure one observation at a time. This has allowed us to avoid the explicit repre- 
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Figure 7: Cumulative returns from forming the optimal portfolio weights based on running the full 
GGM-iHMM model as well as the iHMM only and GGM only models, run over time periods 20 
to 300. The final cumulative return was 17.2% for the GGM-iHMM, 13.5% for the GGM only and 
11.2% for the iHMM only models. 
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sentation (and the sampling) of means and covariance parameters, which can be computationally 
intensive. However, Polya urn samplers can suffer from slow mixing and we plan to explore in the 
near future alternative computational algorithms, in particular those employing split-merge moves 
such as those developed in [Jain & N"eaT| ( |2004| ) and |Jain & Neal| ( |2007| ). 

The data analyses in this paper suggest that implementing mixtures of GGMs using Markov 
chain Monte Carlo algorithms is feasible for a moderate numbers of variables. However, many 
interesting applications of GGM mixtures (e.g., gene-expression data) involve outcomes in much 
higher dimensions, where previous experience suggest that random-walk MCMC algorithms will 
be inefficient. We are currently exploring deterministic search algorithms based on heuristics, such 



as the feature-inclusion ( [Berger & Molina[ |2005^ [Scott & Carvalho[|2008] ), that might allow us to 
identify high-probability partitions and their associated graphs. 

Our framework extends to include nondecomposable graphs in a straightforward manner. The 
only difference comes in the implementation of our sampling algorithms: for nondecomposable 



graphs the marginal likelihood (10) and the predictive distribution ( 11 1 must be numerically ap- 
proximated instead of being calculated directly through formulas. We are currently working on 
this extension based on the recent developments from Lenkoski & Dobra ( 2010[ ). 
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Appendix 

We give a brief description of an auxiliary variable scheme for sampling from the posterior distri- 
butions of the single concentration parameter otq in Section|4]and the two concentration parameters 
a and ao from Section [S!2] - see [Escobar & West| ( |1995| ) and [Teh et al.[P006l ) for full details. We 



assume that the priors for a and ao are G(a, b) and G(ao, bo), respectively. 

In the case of the GGM-DPM, we sample ao by introducing an auxiliary variable 77. Conditional 
on oro, we have ?7|ffo ~ Beta(Q'o + ^,n). Conditional on 77, uq follows a mixture distribution 

Qrol?7 ~ dr^<G{aQ + L,bQ -logT]) + {\ - d,j)Giao + L-l,bo-log if), 

where d,^l{\ - dr,) = {ao + L- l)l[n{bo - log(77))]. 
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In the case of the GGM-iHMM we additionally introduce auxiliary variables . . . and 
ui,...,ul. Conditionally of a, qi\a ~ Beta(Qr + l,r/.) and ui\a ~ Bernoulli(r/./((3^ + n)), where 
^/ = J!if'=i ^iv- Then, a is sampled from its full conditional distribution. 



where m.. = Z^=i f^w- To sample otq, we follow a procedure that is very similar to the one we 
used for the GGM-DPM. Again, we introduce an auxiliary variable 77. Conditional on ao, we have 
r]\ao ~ Beta(a?o + l,m..). Conditional on rj, ao follows a mixture distribution 

aoln ~ dr,G(ao + L, Zjq - log 77) + (1 - dr,)G(aQ + L - 1, Zjq - log 77), 

where J^/(l - dr,) = (ao + L- l)/[m..(bo - log(/7))]. 
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